On Approximating String Selection Problems with Outliers

نویسندگان

  • Christina Boucher
  • Gad M. Landau
  • Avivit Levy
  • David Pritchard
  • Oren Weimann
چکیده

Many problems in bioinformatics are about finding strings that approximately represent a collection of given strings. We look at more general problems where some input strings can be classified as outliers. The Close to Most Strings problem is, given a set S of same-length strings, and a parameter d, find a string x that maximizes the number of “non-outliers” within Hamming distance d of x. We prove this problem has no PTAS unless ZPP = NP, correcting a decade-old mistake. The Most Strings with Few Bad Columns problem is to find a maximum-size subset of input strings so that the number of non-identical positions is at most k; we show it has no PTAS unless P = NP. We also observe Closest to k Strings has no EPTAS unless W[1] = FPT. In sum, outliers help model problems associated with using biological data, but we show the problem of finding an approximate solution is computationally difficult.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Robust high-dimensional semiparametric regression using optimized differencing method applied to the vitamin B2 production data

Background and purpose: By evolving science, knowledge, and technology, we deal with high-dimensional data in which the number of predictors may considerably exceed the sample size. The main problems with high-dimensional data are the estimation of the coefficients and interpretation. For high-dimension problems, classical methods are not reliable because of a large number of predictor variable...

متن کامل

Outlier Detection for DNA Fragment Assembly

A major impediment in the development of efficient full genome sequencing is the large portion of erroneous reads produced by sequencing platforms. Error correction is the computational process that attempts to identify and correct these mistakes. Several classical stringology problems, including the Consensus String problem, are used to model error correction. However, a significant shortcomin...

متن کامل

Identification of outliers types in multivariate time series using genetic algorithm

Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA m...

متن کامل

Outlier Detection by Boosting Regression Trees

A procedure for detecting outliers in regression problems is proposed. It is based on information provided by boosting regression trees. The key idea is to select the most frequently resampled observation along the boosting iterations and reiterate after removing it. The selection criterion is based on Tchebychev’s inequality applied to the maximum over the boosting iterations of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Theor. Comput. Sci.

دوره 498  شماره 

صفحات  -

تاریخ انتشار 2012